In the field of instrumental analytical chemistry, there are many techniques for identifying unknown substances. This field is primarily researched for the detection of explosives, drugs and chemical weapons, however, there are many other applications for it. In the healthcare field, researchers are using it to detect cancer and diseases. Chromatography is a method by which a substance is separated into different components and split across time. For example, if you sample a mixture that is made up of multiple substances you could separate those substances and detect them individually with chromatography. One method of chromatography is using what is called a column. This device is what will separate out the components of a mixture, however, it is extremely slow. It takes minutes to work.
Mass spectrometry is the technique of identifying the chemical structure of a substance by separating ions into differing mass and charge. The three essential components of a mass spectrometer are the ion source, the mass analyzer, and the detector. The ionizer ionizes the sampled substance. The analyzer sorts and separates the ions by mass and charge. The detector measures or counts the separated ions. Ions can be displayed as a histogram of mass-to-charge ratio(m/z) or can be displayed across time to see line curves where the peaks would be where the max quantity of ions were detected. There are many different techniques for each component in performing the identification of substances.
The most popular analytical technique today is ion-mobility spectrometry (IMS). This technique separates and identifies ionized molecules based on their mobility within a carrier gas. It is extremely fast and takes milliseconds to achieve a result, but it is less sensitive than other techniques. This technique is very popular because you can make an IMS device for relative low cost compared to other techniques and the device can be small enough to be hand-held.
The final technique that we will discuss is triple quadrupole mass spectrometry (TQMS). This is a tandem mass spectrometry technique, which just means it has two analyzers. The components in order are the ion source, a quadrupole mass analyzer, a quadrupole that acts as a collision cell to fragment the molecules entering it, a second analyzer that analyzes the resulting fragments, and finally the detector. The quadrupole works by creating an electro-magnetic field that separates the ions and makes them follow a trajectory based on their mass-to-charge ratio (m/z). This technique in theory is the most sensitive and will achieve results in seconds. These devices tend to be very expensive and large. This is the device a team and I are working on.
The above techniques discussed can all be combined to solve problems depending on the application. There are trade-offs that must be made for each technique. Cost and weight are always a major factor. In some cases, the science is not well understood.
My team and I are currently working on a triple quad mass spectrometer that is cheaper and smaller so that we can address new markets and applications that TQMS was unable to address previously. Our current instrument displays mass-to-charge ratios over time. We have in the past used peak thresholds to determine what is a detected substance. This technique only gives us an accuracy of 40% with a very high rate of false positives. We are trying to achieve an accuracy of 90% with no more than a 2% false positive rate. We have talked about adding some filtering techniques, but there will be a tradeoff in time and cost. We need to complete our analysis in under 10 seconds. Ideally, we can solve our problem with purely algorithms. First, I would like to see if we can use our existing approach of classifying compounds based on peak features of relevant mass pairs. We need to first assess what peak features distinguish our detections from noise. If that does not work then I should be able to use a 1D CNN to learn the mass pair intensity shapes.
The datasets that will be used in this project were generated from collected samples from our instrument. The instrument was sent out for testing and 12 different substances were tested. The data files we got back are the results from that testing. The datasets are generated from these data files. The datasets have been modified to abstract out any sensitive details such as the substance name and mass pairs. Most importantly the intensities are all generated to mimic the shape of the collected data. The data has also been filtered to remove any malformed data because of a hardware or any other error. There is no proprietary data associated with this project. The model that will be built will need to be re-trained on the actual proprietary dataset to have it work with our instrument. The generated data should be more than adequate in evaluating a model. In most cases, I have between 50 and 80 samples for each substance I am testing for. I realize that this may not be enough, but I am also trying to gauge how many samples will be needed if extending out the substance library. If a compound is performing poorly from lack of samples I will remove it from the test.
Each data file consists of multiple components. First, there will be a mass pair transition id. A mass pair transition consists of an ion charge (+ or -), parent mass, a daughter mass, and collision energy i.e. +123->456(78). The ion charge is from the ion source, the parent mass is from the first quad, the daughter mass is from the third quad and the collision energy is applied at the second quad. Instead of seeing this transition you will see a number 0 to n-1 where n is the total quantity of specified transitions (i.e. n=51). After the id there will be a sample id associated to the dataset, a comment field which will specify if another substance was combined the tested for substance, and what substrate it was sampled on. Substrate could have the value direct, or a harvest code like Perf5. Direct means the substance was inserted into the instrument through a syringe. Direct should be the most stable result. If the substance was harvested off material Perf5 that means the substance was applied to the material and then swiped off with one of our swabs. Theoretically when a substance is harvested it should measure lower ions than direct because you may not of collected the entire amount of the substance. After the substrate field, there will be detection field and an association field. The detection field will have an array of numbers specifying what compounds are detected within that dataset. The association field will specify what mass pairs are associated to which compounds according to our chemist, for example mass pair 1 is associated to compounds [1,3]. After the associated field there will be a height, width, area, and position of the mass pair peak. These values are acquired by applying a smoothing filter to reduce the noise of the signal. Finally, there will be a time series of intensities over a specified number of time steps or scans (i.e. 23). For now, the scan count is fixed to 23, however, in the future it would be better to have scan count be variable in case we want to stop early. So, for each data file you would have that structure per row times the amount of transitions for example 51x33.
# Install required packages in the current Jupyter kernel
import sys
#Used with Anaconda 64 3.7
#run as admin
!"{sys.executable}" -m pip install -U pip
!"{sys.executable}" -m pip install -U matplotlib
!"{sys.executable}" -m pip install -U scikit-learn
!"{sys.executable}" -m pip install -U setuptools
!"{sys.executable}" -m pip install joblib
!"{sys.executable}" -m pip install tensorflow
!"{sys.executable}" -m pip install tensorflow-gpu
!"{sys.executable}" -m pip install keras
!"{sys.executable}" -m pip install scipy
#Load datasets
import pandas as pd
import glob, os
os.chdir("./data")
samples = []
count = 0
max_mass_pair_count = 0
for file in glob.glob("*.csv"):
data_frame = pd.read_csv(file)
count += 1
mass_pair_count = len(data_frame)
max_mass_pair_count = mass_pair_count if mass_pair_count > max_mass_pair_count else max_mass_pair_count
samples.append(data_frame)
os.chdir("..")
merged_data_set = pd.concat(samples, ignore_index=True)
print("Number of Samples: ", count)
print("Max number of mass pairs: ", max_mass_pair_count)
#multi detection
display(merged_data_set.head(1))
#single detection row
display(samples[10].head(1))
# no detection
display(samples[1].head(1))
results = (merged_data_set.groupby(['detection']).count()/max_mass_pair_count)['mass_pair_id']
results.index.name = 'Compounds'
results.index = sorted(results.index)
results.name = 'Sample count per compound'
results.round() # in case sample file has extra mass pair
Compound 7 and 19 do not have enough samples. I will perform data analysis anyways. I will consider dropping later.
The model can only be benchmarked against the previous solution which yields a total accuracy of 40%. Minimum peak height was the only threshold where it was manually set based on internal lab testing. for example, compound 1 could have a minimum height threshold of 1600 ion counts. If the peak signal was below this amount it was deemed noise and ignored. If it was above it would be part of some additional logic that required all associated mass pairs to be above their limits to raise a substance detection alert. Below is most of a confusion matrix, but TNR is missing.
| Compound ID | TPR | FPR | FNR |
|---|---|---|---|
| Compound 3 | 5.00% | 4.55% | 95.00% |
| Compound 4 | 30.34% | 52.81% | 32.58% |
| Compound 7 | 0.00% | 87.50% | 12.50% |
| Compound 8 | 13.33% | 20.00% | 66.67% |
| Compound 10 | 12.31% | 7.69% | 86.15% |
| Compound 13 | 0.00% | 21.21% | 84.85% |
| Compound 14 | 5.00% | 0.00% | 95.00% |
| Compound 15 | 20.37% | 12.96% | 68.52% |
| Compound 18 | 59.65% | 28.95% | 14.91% |
| Compound 19 | 20.00% | 70.00% | 50.00% |
| Compound 21 | 72.90% | 26.17% | 0.93% |
| Compound 22 | 70.27% | 17.12% | 15.32% |
Accuracy: ~40%
To evaluate our models, we should be using a weighted accuracy metric such as fbeta_score and a confusion matrix to see our false positive rate. According to our requirements we could miss a detection 10% of the time if we have a false positive rate below 2%. Since it is more important to be precise than have high recall we should have our beta be a value of 0.5.
I need to build a compound to mass pair lookup table in order to find the mass pairs that are important.
import numpy as np
#All samples have the same asssociation per mass pair
#take a sample as input
def string_to_list_of_int(string):
mod_str = string.replace('[', '').replace(']', '').split(',')
return list(map(int, mod_str))
compounds_to_detection_data_set = merged_data_set.groupby(['detection']).groups
compound_id_to_mass_pair_ids = dict()
for compounds, detection_indices in sorted(compounds_to_detection_data_set.items()):
if compounds == 'None':
continue
else:
compound_ids = string_to_list_of_int(compounds)
compound_dataset = merged_data_set.iloc[detection_indices]
compound_dataset = compound_dataset.iloc[:max_mass_pair_count]
for i, row in compound_dataset.iterrows():
detected_compounds = set(compound_ids)
association = set(string_to_list_of_int(row['association']))
associated_compounds = detected_compounds.intersection(association)
if len(associated_compounds) > 0:
for associated_compound in associated_compounds:
if associated_compound not in compound_id_to_mass_pair_ids:
compound_id_to_mass_pair_ids[associated_compound] = []
if row['mass_pair_id'] not in compound_id_to_mass_pair_ids[associated_compound]:
compound_id_to_mass_pair_ids[associated_compound].append(row['mass_pair_id'])
compound_id_to_mass_pair_ids[-1] = np.arange(max_mass_pair_count).tolist()
compound_id_to_mass_pair_ids = dict(sorted(compound_id_to_mass_pair_ids.items()))
compound_count = max(compound_id_to_mass_pair_ids.keys()) + 1
compound_id_to_mass_pair_ids
Explore peak properties first. If data ends up being gaussian then we can just use a non-parametric solution. We will need to scale values because they are disproportionate and large across feature columns. I need to see a description of the data by mass pair id to compound so I can determine what features look the most promising.
peak_properties = ['peak_height', 'peak_width', 'peak_area', 'peak_position']
compounds_to_detection_data_set = merged_data_set.groupby(['detection']).groups
mass_pair_id_to_group_indices = merged_data_set.groupby(['mass_pair_id']).groups
mass_pair_id_to_compound_ids_to_dataset = dict()
for compounds, detection_indices in sorted(compounds_to_detection_data_set.items()):
if compounds == 'None':
compound_ids = [-1]
else:
compound_ids = string_to_list_of_int(compounds)
compound_indices = compounds_to_detection_data_set[compounds]
for compound in compound_ids:
for mass_pair_id in compound_id_to_mass_pair_ids[compound]:
mass_pair_indices = mass_pair_id_to_group_indices[mass_pair_id]
#intersection of group indices
intersection_indices = list(set(mass_pair_indices).intersection(set(compound_indices)))
dataset = merged_data_set.iloc[intersection_indices]
if mass_pair_id not in mass_pair_id_to_compound_ids_to_dataset:
mass_pair_id_to_compound_ids_to_dataset[mass_pair_id] = dict()
if compound not in mass_pair_id_to_compound_ids_to_dataset[mass_pair_id]:
mass_pair_id_to_compound_ids_to_dataset[mass_pair_id][compound] = dataset
else:
pd.concat([mass_pair_id_to_compound_ids_to_dataset[mass_pair_id][compound], dataset])
for mass_pair_id, compound_id_to_dataset in mass_pair_id_to_compound_ids_to_dataset.items():
for compound_id, dataset in compound_id_to_dataset.items():
if compound_id == -1:
continue
print("Mass pair ID: ", mass_pair_id)
print("Compound id: ", compound_id)
display(dataset[peak_properties].describe())
After seeing a description of the data, it is obvious that width and position seem to be the most stable features with the least amount of variance. By looking at the data it seems there may be some correlation between height to area and width to position. I need to test to make sure so that I can remove the features that are correlated to improve my future model's accuracy. I'm going to look at all data first, then break it down.
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
peak_props_all_data = merged_data_set[peak_properties]
sns.heatmap(peak_props_all_data.corr())
# Produce a scatter matrix for each pair of features in the data
pd.plotting.scatter_matrix(peak_props_all_data, alpha=0.3, diagonal='kde')
plt.show()
You can tell there is a strong correlation between peak height and area. This makes sense that area under our peak would be calculated based on height. Let’s remove area and see if we can see any other correlations once we break down the data to important compound to mass pair associations. I will use a heatmap and scatter matrix to determine correlations and the distribution of data related to each feature.
peak_properties = ['peak_height', 'peak_width', 'peak_position']
for mass_pair_id, compound_id_to_dataset in mass_pair_id_to_compound_ids_to_dataset.items():
for compound_id, dataset in compound_id_to_dataset.items():
if compound_id == -1:
continue
print("Mass pair ID: ", mass_pair_id)
print("Compound id: ", compound_id)
peak_props = dataset[peak_properties]
sns.heatmap(peak_props.corr())
# Produce a scatter matrix for each pair of features in the data
pd.plotting.scatter_matrix(peak_props, alpha=0.3, diagonal='kde')
plt.show()
In most cases there is no other correlation. Some of the heatmaps show a little bit, but I think it could just be due to a lack of samples. Some of the scatter plots show a shifted gaussian curve, however, in other cases the distribution of the data looks non-gaussian. I am going to apply a log transform to see if I can change the data to be more gaussian.
import numpy as np
for mass_pair_id, compound_id_to_dataset in mass_pair_id_to_compound_ids_to_dataset.items():
for compound_id, dataset in compound_id_to_dataset.items():
if compound_id == -1:
continue
print("Mass pair ID: ", mass_pair_id)
print("Compound id: ", compound_id)
peak_props = dataset[peak_properties]
sns.heatmap(peak_props.corr())
# Produce a scatter matrix for each pair of features in the data
pd.plotting.scatter_matrix(np.log(peak_props).replace([np.inf, -np.inf], np.nan).fillna(0), alpha=0.3, diagonal='kde')
plt.show()
In some cases, the transformation of the data works well. In most cases it does not. It seems that either I do not have enough samples to represent the total population of samples or the current features are not deterministic enough to determine detections.
For the most part, width seems to be a very stable and independent feature. Position and height even though not correlated tend to mimic each other so I could probably get away with using either just height or position. I will try position because it previously had the least amount of standard deviation. Some of the mass pair to compounds are much more stable than others. On all cases, there are some outliers. Depending on the algorithm I use, I may need to remove these outliers to not overfit my model. Next, I want to plot and label all my mass pair data to get an idea of how well my algorithm will do on test data.
def find_outliers(gaussian_data: pd.DataFrame):
outliers = []
commonOutliers = []
# For each feature find the data points with extreme high or low values
for feature in gaussian_data.keys():
# Calculate Q1 (25th percentile of the data) for the given feature
Q1 = np.percentile(gaussian_data[feature], 25)
# Calculate Q3 (75th percentile of the data) for the given feature
Q3 = np.percentile(gaussian_data[feature], 75)
# Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
step = (Q3 - Q1) * 1.5
data = gaussian_data[~((gaussian_data[feature] >= Q1 - step) &
(gaussian_data[feature] <= Q3 + step))]
# Display the outliers
#print("Data points considered outliers for the feature '{}':".format(feature))
#print(data)
for index, _ in data.iterrows():
if index not in outliers:
outliers.append(index)
elif index not in commonOutliers:
commonOutliers.append(index)
#print('Outliers:')
#print(sorted(outliers))
# Remove the outliers, if any were specified
# good_data = gaussian_data.drop(outliers).reset_index(drop=True)
return outliers
def plot_with_labels(mass_pair_id, compound_id_to_dataset, peak_properties, create_model=None):
data = []
outliers = []
groups = []
xColumnLabel, yColumnLabel = peak_properties
for compound_id, dataset in compound_id_to_dataset.items():
dataset = dataset[peak_properties]
if compound_id == -1:
outliers.append(None)
groups.append('Noise')
else:
found_outliers_indices = find_outliers(dataset)
outliers.append(dataset.loc[found_outliers_indices])
#dataset = dataset.drop(found_outliers_indices).reset_index(drop=True)
groups.append('Compound {}'.format(compound_id))
data.append(dataset)
if len(groups) <= 1:
return
# Create plot
fig = plt.figure()
ax = fig.add_subplot(111)
title = 'Mass Pair ID: {}'.format(mass_pair_id)
ax.set_xlabel(xColumnLabel)
ax.set_ylabel(yColumnLabel)
dataset = None
for datum, group, outlier in zip(data, groups, outliers):
if dataset is None:
dataset = datum.assign(output=-1)#Noise is first
else:
output = int(group.split()[-1])
temp = datum
temp = temp.append(outlier)
temp = temp.assign(output=output)
dataset = dataset.append(temp)
X = dataset[peak_properties].values
y = dataset['output'].values
#fit model
if create_model is not None:
model = create_model()
h = .02 # step size in the mesh
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
model.fit(X, y)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.8)
# Plot also the training points
scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.tab10)
plt.xticks(())
plt.yticks(())
# produce a legend with the unique colors from the scatter
legend = ax.legend(*scatter.legend_elements(), bbox_to_anchor=(1, 1), title="Compound ID")
ax.add_artist(legend)
#for datum, group, outlier in zip(data, groups, outliers):
# X1, X2 = datum[xColumnLabel], datum[yColumnLabel]
# ax.scatter(X1, X2, edgecolors='none', s=30, label=group, cmap=plt.cm.coolwarm)
# if outlier is not None and len(outlier) > 0:
# ax.scatter(outlier[xColumnLabel], outlier[yColumnLabel],
# alpha=1, edgecolors='face', s=35, label='{} Outlier'.format(group))
plt.title(title)
for mass_pair_id, compound_id_to_dataset in mass_pair_id_to_compound_ids_to_dataset.items():
plot_with_labels(mass_pair_id, compound_id_to_dataset, ['peak_width', 'peak_position'])
plt.show()
Some mass pairs look better than others. Let me see how height looks instead of position.
for mass_pair_id, compound_id_to_dataset in mass_pair_id_to_compound_ids_to_dataset.items():
plot_with_labels(mass_pair_id, compound_id_to_dataset, ['peak_width', 'peak_height'])
plt.show()
In most cases, graphically height looks a lot worse than position as a feature. In both cases, the features seem less than ideal. Let's apply a polynomial support vector machine to the data. My hope is to create a very general classification that combined with all the associated mass pairs we can gain good results. Based on the data, it appears we could probably draw circles around pieces of the data, therefore, we could use an support vector machine(SVM) of either polynomial or RBF. RBF may overfit so I will first try polynomial and see the resuls. I will graph the classifications so I can see how well the algorithm fits per mass pair.
There are multiple reasons I would choose an SVM over other parametric algorithms. In most cases, the data is not linearly separable. This would rule out using a logistic regression algorithm or even a decision tree because a decision tree would probably overfit the data. Now another problem, I have is most of the data is not gaussian. Since it is not gaussian, I must be weary of creating classification lines that are too tight resulting in overfitting of the data. SVMs by design do not overfit the data because they try to maximize the margin between our 2 classes. By using a polynomial SVM I can transform the existing data into a higher dimension resulting in data that is more separable than it was before. This idea is called the kernel trick. In simplified terms, it works by taking the inner product of our data and using those values to increase our dimensionality hopefully creating a distinct decision boundary.
from sklearn.svm import SVC
C = 1.0 # SVM regularization parameter
for mass_pair_id, compound_id_to_dataset in mass_pair_id_to_compound_ids_to_dataset.items():
plot_with_labels(mass_pair_id, compound_id_to_dataset, ['peak_width', 'peak_position'], lambda: SVC(kernel='poly', degree=3, gamma='scale', C=C))
plt.show()
After reviewing each mass pair graph, it appears that in some cases we can distinguish quite well, while in other cases we cannot. Looking at mass pair id 46 we can see that the regions drawn are quite elaborate. This is very concerning because it could indicate that the current feature set we are using will not scale well if we were to expand our compound library. Peak shape may be a possible distinguisher. I will have to use a CNN on the intensities to see if they can be classified accordingly.
As previously discussed, our benchmark will be a true positive rate of greater or equal to 90% and a false positive rate less than or equal to 2%.
#reminder of what data looks like
display(merged_data_set.head(1))
max_timestep = int(merged_data_set.columns[-1].split('_')[-1])
I want to build a data structure that would allow me to easily switch between different algorithm prototypes.The data structure will be a dictionary of compound_id to associated mass pairs. Each associated mass pair will have a lookup of their timestep intensity values that will be passed into its respective model. I do not need to pad the time steps because they are currently all the same. We will investigate later if smoothing the time step intensities has any impact. We will normalize the timesteps when passing them into the model. The normalization will work by taking the max intensity of all relevant mass pairs to a compound and dividing all intensities by that max value therefore making all intensities values between 1 and 0. This normalization allows us to retain the shape of our intensities while allowing us to scale the values down for faster learning.
def get_compound_ids_to_mass_pair_ids_to_dataset(merged_data_set, compound_id_to_mass_pair_ids):
merged_data_set = merged_data_set.copy()
compounds_to_detection_data_set = merged_data_set.groupby(['detection']).groups
mass_pair_id_to_group_indices = merged_data_set.groupby(['mass_pair_id']).groups
timestep_columns = merged_data_set.columns[10:]
compound_ids_to_mass_pair_ids_to_dataset = dict()
for compounds, compound_detection_indices in sorted(compounds_to_detection_data_set.items()):
if compounds == 'None':
continue
else:
compound_ids = string_to_list_of_int(compounds)
for compound_id in compound_ids:
if compound_id not in compound_id_to_mass_pair_ids:
continue #skip
if compound_id not in compound_ids_to_mass_pair_ids_to_dataset:
compound_ids_to_mass_pair_ids_to_dataset[compound_id] = dict()
for mass_pair_id in compound_id_to_mass_pair_ids[compound_id]:
mass_pair_indices = mass_pair_id_to_group_indices[mass_pair_id]
#intersection of group indices
detection_indices = list(set(mass_pair_indices).intersection(set(compound_detection_indices)))
no_detection_indices = list(set(mass_pair_indices) - set(detection_indices))
detection_dataset = merged_data_set.iloc[detection_indices][timestep_columns]
no_detection_dataset = merged_data_set.iloc[no_detection_indices][timestep_columns]
detection_dataset = detection_dataset.assign(detection=compound_id)
no_detection_dataset = no_detection_dataset.assign(detection=-1)
dataset = detection_dataset.append(no_detection_dataset)
#print(set(dataset['detection']))
if mass_pair_id not in compound_ids_to_mass_pair_ids_to_dataset[compound_id]:
compound_ids_to_mass_pair_ids_to_dataset[compound_id][mass_pair_id] = dataset
else:
pd.concat([compound_ids_to_mass_pair_ids_to_dataset[compound_id][mass_pair_id], dataset])
compound_ids_to_mass_pair_ids_to_dataset[compound_id] = dict(sorted(compound_ids_to_mass_pair_ids_to_dataset[compound_id].items()))
compound_ids_to_mass_pair_ids_to_dataset = dict(sorted(compound_ids_to_mass_pair_ids_to_dataset.items()))
return compound_ids_to_mass_pair_ids_to_dataset
compound_ids_to_mass_pair_ids_to_dataset = get_compound_ids_to_mass_pair_ids_to_dataset(merged_data_set, compound_id_to_mass_pair_ids)
for compound_id in compound_ids_to_mass_pair_ids_to_dataset.keys():
print("Compound ", compound_id)
print("Mass pairs", list(compound_ids_to_mass_pair_ids_to_dataset[compound_id].keys()))
I will build a temporal CNN using 1D convolution layers. I could build one using 2D layers, however, I think we could simplify the model by using just the intensities without absolute accurate time coordinates.
Model design: Our current data follows the format mass pairs to intensities or timesteps. Our model needs to be timesteps to mass pairs. For ease of use we allow the data to be passed in its original format and we will permute the data to reformat it for our needs. Next, we will follow the pattern of using 2 convolutional layers and then a pooling layer to extract and simplify features and the dataset. By using 2 iterations of this design there should be enough features to determine if there are any consistent patterns that allow for accurate classification. I finished the model with a dropout layer as to not overfit our data.
Since I want to have a model made per compound id our last layer will be a dense node of 1 with a sigmoid activation. If we wanted to have a multilabel model we would have the final layer be the number of classes and an activation function of softmax.
from keras.models import Sequential, Model
from keras.layers import Dense, Conv1D, \
MaxPooling1D, GlobalAveragePooling1D, Dropout, \
Input, Permute
def build_compound_model(mass_pair_count, max_timestep, print_summary=True):
input = Input(shape=(mass_pair_count, max_timestep))
x = Permute((2, 1))(input)
x = Conv1D(128, 5, activation='relu')(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D()(x)
x = Conv1D(128, 3, activation='relu')(x)
x = Conv1D(128, 3, activation='relu')(x)
x = GlobalAveragePooling1D()(x)
x = Dropout(0.5)(x)
shape_output = Dense(1, activation='sigmoid')(x)
model = Model(input, shape_output)
if print_summary:
print(model.summary())
return model
Next, create an algorithm that passes all associated mass pair intensities into the created model. We could create a multilabel model, but for experimentation purposes, I would like to create a dictionary of compound to model. I will compile each model using the Adam optimizer for updating the graph weights based on the training data and our loss function will be binary cross entropy because our output is theoretically binary. Our output is a probability that we will need to apply a threshold limit to determine if we should mark it as a true detection or no detection. We will use a ROC curve on the training data to determine the best threshold with the parameters of having less than 2% false positives. I will need to stack the associated mass pairs to intensities in order to pass them into my created model.
from sklearn.model_selection import train_test_split
from keras.callbacks import ModelCheckpoint, EarlyStopping
from sklearn.metrics import roc_curve, roc_auc_score
#from sklearn.tree import DecisionTreeClassifier
import time
def build_train_model(mass_pair_ids_to_dataset):
mass_pair_ids_to_dataset_copy = mass_pair_ids_to_dataset.copy()
mass_pair_count = len(mass_pair_ids_to_dataset_copy)
model = build_compound_model(mass_pair_count, max_timestep, False)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
intensities_and_outputs = np.stack(list(mass_pair_ids_to_dataset_copy.values()), axis=1)
print(list(mass_pair_ids_to_dataset_copy.keys()))
#validate
#for i, (mass_pair_ids, dataset) in enumerate(mass_pair_ids_to_dataset.items()):
# if not np.array_equal(intensities_and_outputs[1,i], dataset.to_numpy()[1]):
# print("stack fail")
# break
X = intensities_and_outputs[:,:,:-1]
#normalize
def normalize(array):
for i, data in enumerate(array):
array[i] = np.nan_to_num(data/np.amax(data))
return array
X = normalize(X)
y = intensities_and_outputs[:,0,-1]#all of axis 1 will have same value
#convert y to binary form
y = (y >= 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
detected_count = sum(y_train)
non_detected_count = len(y_train) - detected_count
arange = np.arange(X.shape[-1])
fig = plt.figure()
plot_count = detected_count if detected_count < 8 else 8
detected_train_X = X_train[np.where(y_train > 0)]
mass_pairs = list(mass_pair_ids_to_dataset_copy.keys())
subplot = 241
for sample_index in range(plot_count):
ax = fig.add_subplot(subplot)
subplot += 1
for mass_pair_index, array in enumerate(detected_train_X[sample_index]):
lines = ax.plot(arange, array, label=mass_pairs[mass_pair_index])
ax.legend(loc='upper right')
plt.subplots_adjust(right=3, top=3)
plt.show()
print("Train on detected ", detected_count, " non-detected ", non_detected_count)
start = time.time()
early_stopping = EarlyStopping('val_acc', patience=10)
model.fit(X_train, y_train, epochs=1000, verbose=0,
validation_split=0.2, callbacks=[early_stopping])
end = time.time()
train_time = end - start
print("Train time in seconds: ", train_time)
y_hat = model.predict(X_train)
fprs, tprs, thresholds = roc_curve(y_train, y_hat)
#choose threshold where fpr <= 2% and tpr is the greatest
potential_indices = np.where(fprs <= 0.02)
max_tpr = max(tprs[potential_indices])
index = tprs.tolist().index(max_tpr)
threshold = float(thresholds[index])
#dt = DecisionTreeClassifier(random_state=0, max_depth=1)
#dt.fit(y_hat, y_test)
#threshold = dt.tree_.threshold[0]
#plot threshold
fig, ax = plt.subplots()
detection_indices = np.where(y_test > 0)
no_detection_indices = np.where(y_test == 0)
ax.scatter(y_hat[no_detection_indices], y_test[no_detection_indices], label="No Detection")
ax.scatter(y_hat[detection_indices], y_test[detection_indices], label="Detection")
ax.plot([threshold, threshold], [-1, 2], color='g')
ax.legend()
ax.grid(True)
plt.show()
#apply theshold
y_hat = (y_hat >= threshold).astype(int)
auc_score = roc_auc_score(y_train, y_hat)
predicted_tpr = tprs[index]
predicted_fpr = fprs[index]
print("AUC", auc_score)
print("Predicted TPR", predicted_tpr)
print("Predicted FPR", predicted_fpr)
print("Threshold", threshold)
print("-----------------------End of training---------------------------")
return model, (X_test, y_test), predicted_tpr, predicted_fpr, threshold, auc_score
from sklearn.metrics import accuracy_score, confusion_matrix, fbeta_score, classification_report, precision_recall_fscore_support, \
roc_curve, roc_auc_score
def test_model_and_output_results(model, testset, threshold, print_results=True):
X_test, y_test = testset
y_hat = model.predict(X_test)
#plot threshold
fig, ax = plt.subplots()
detection_indices = np.where(y_test > 0)
no_detection_indices = np.where(y_test == 0)
ax.scatter(y_hat[no_detection_indices], y_test[no_detection_indices], label="No Detection")
ax.scatter(y_hat[detection_indices], y_test[detection_indices], label="Detection")
ax.plot([threshold, threshold], [-1, 2], color='g')
ax.legend()
ax.grid(True)
plt.show()
#apply theshold
y_hat = (y_hat >= threshold).astype(int)
count = len(y_test)
detected_count = sum(y_test)
non_detected_count = count - detected_count
if print_results:
print("Test on detected ", detected_count, " non-detected ", non_detected_count)
print("F0.5 {0:.2%}".format(fbeta_score(y_test, y_hat, beta=0.5)))
print()
print("Confusion Matrix:")
tn, fp, fn, tp = confusion_matrix(y_test, y_hat).ravel()
fpr = fp/non_detected_count
tpr = tp/detected_count
if print_results:
print("True Negative {0:.2%}".format(tn/non_detected_count))
print("False Positive {0:.2%}".format(fpr))
print("False Negative {0:.2%}".format(fn/detected_count))
print("True Positive {0:.2%}".format(tpr))
print()
print("Classification Report")
print(classification_report(y_test, y_hat, target_names=['blank', 'detected'], digits=3))
print("---------------------------end of testing--------------------------------")
return tpr, fpr
def compound_id_results_to_dataframe(compound_id_to_results, columns=["Compounds", "Tr-TPR" ,"Tr-FPR", "Threshold", "ROC AUC Score", "Test-TPR", "Test-FPR"]):
table = []
for compound_id, results in compound_id_to_results.items():
row = []
for result in results:
if isinstance(result, (float, np.float32, np.float64)):
value = "{:.2%}".format(result)
else:
value = result
row.append(value)
row.insert(0, compound_id)
table.append(row)
return pd.DataFrame(table, columns=columns)
compound_id_to_results = dict()
for compound_id, mass_pair_ids_to_dataset in compound_ids_to_mass_pair_ids_to_dataset.items():
print()
print("Compound ID", compound_id)
model, testset, predicted_tpr, predicted_fpr, threshold, auc_score = build_train_model(mass_pair_ids_to_dataset)
tpr, fpr = test_model_and_output_results(model, testset, threshold, auc_score)
compound_id_to_results[compound_id] = predicted_tpr, predicted_fpr, threshold, auc_score, tpr, fpr
original_results_dataframe = compound_id_results_to_dataframe(compound_id_to_results)
print(original_results_dataframe.to_string(index=False, justify='center'))
print()
The above results have come out far better than I expected. The compounds that I am skeptical of are compound 7 and 19. Compound 19 has only trained on 3 true detections and compound 7 only has 8 true samples. There probably needs to be more sample detections for this result to be accurate. I will ignore this result for now. By plotting one sample we can predict what models will do well. For example, compound ID 8 has 2 sets of intensities that should probably be removed from its association. I'm speaking of the red and orange lines because they are very noisy and do not have one clear peak. Let's review our test data before making any further decisions on how to refine our data and/or algorithm
Some challenges I ran into in implementation were how to normalize my data in order to create trainable and consistent signals. Normalizing and scaling will always be a problem for us especially if we must start quantifying how much of a compound we have detected. Our instruments vary in sensitivity by some percent, so it took some thought in deciding the best way to preprocess our data in a way that would allow our trained model to work across instruments. I ended up choosing to min-max scale our values based on the largest intensity in our window. This would scale all the values between 1 and 0 which would allow us to train a model faster and it would keep the integrity of the mass pair intensities in relation to each other. The only thing I worry about is having volatile intensities that change the look of our intensities. For example, if mass pair 46 has a huge intensity peak when it normally does not it will change the look of our signals for the compounds that use that mass pair. Now the signal that is usually the largest only reaches half the size it normally does because it is now not the highest peak. This constraint makes it very important when we select what mass pairs are important for each compound. It could make the difference between a 40% detection rate to a 90% detection rate.
Compounds 0, 21, 22 currently meet our goal.
The rest need more data or further refinement to meet our goal.
Compound 7 and 19 do not have enough sample data so we will discard them for now.
I will go through each one and see if some improvements can be made by removing some associated mass pair data that might be hurting our results. I also need to consider the noise of a signal. If the mass pair intensity is too noisy then I will not be able to learn the shape. I will probably need to apply some sort of noise filtering per mass pair intensity to achieve consistent results. Let's try filtering first.
from sklearn.model_selection import train_test_split
from keras.callbacks import ModelCheckpoint, EarlyStopping
from scipy.signal import savgol_filter
import time
def build_train_model_with_filter(mass_pair_ids_to_dataset, mass_pair_ids_to_window_size):
mass_pair_ids_to_dataset_copy = mass_pair_ids_to_dataset.copy()
mass_pair_count = len(mass_pair_ids_to_dataset_copy)
model = build_compound_model(mass_pair_count, max_timestep, False)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
def apply_filter(mass_pair_ids_to_dataset, mass_pair_ids_to_window_size):
dataset_list = []
for mass_pair_id, dataset in mass_pair_ids_to_dataset.items():
if mass_pair_id not in mass_pair_ids_to_window_size:
window_size = 0
else:
window_size = mass_pair_ids_to_window_size[mass_pair_id]
timesteps_and_output = dataset.to_numpy()
if window_size >= 5:
timesteps_and_output[:, :-1] = savgol_filter(timesteps_and_output[:, :-1], window_size, 3)
dataset_list.append(timesteps_and_output)
return dataset_list
intensities_and_outputs = np.stack(apply_filter(mass_pair_ids_to_dataset_copy, mass_pair_ids_to_window_size), axis=1)
print(list(mass_pair_ids_to_dataset_copy.keys()))
#validate
#for i, (mass_pair_ids, dataset) in enumerate(mass_pair_ids_to_dataset.items()):
# if not np.array_equal(intensities_and_outputs[1,i], dataset.to_numpy()[1]):
# print("stack fail")
# break
X = intensities_and_outputs[:,:,:-1]
#normalize
def normalize(array):
for i, data in enumerate(array):
array[i] = np.nan_to_num(data/np.amax(data))
return array
#preprocess
X = normalize(X)
y = intensities_and_outputs[:,0,-1]#all of axis 1 will have same value
#convert y to binary form
y = (y >= 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
detected_count = sum(y_train)
non_detected_count = len(y_train) - detected_count
arange = np.arange(X.shape[-1])
fig = plt.figure()
plot_count = detected_count if detected_count < 8 else 8
detected_train_X = X_train[np.where(y_train > 0)]
mass_pairs = list(mass_pair_ids_to_dataset_copy.keys())
subplot = 241
for sample_index in range(plot_count):
ax = fig.add_subplot(subplot)
subplot += 1
for mass_pair_index, array in enumerate(detected_train_X[sample_index]):
lines = ax.plot(arange, array, label=mass_pairs[mass_pair_index])
ax.legend(loc='upper right')
plt.subplots_adjust(right=3, top=3)
plt.show()
print("Train on detected ", detected_count, " non-detected ", non_detected_count)
start = time.time()
early_stopping = EarlyStopping('val_acc', patience=10)
model.fit(X_train, y_train, epochs=1000, verbose=0,
validation_split=0.2, callbacks=[early_stopping])
end = time.time()
train_time = end - start
print("Train time in seconds: ", train_time)
y_hat = model.predict(X_train)
fprs, tprs, thresholds = roc_curve(y_train, y_hat)
#choose threshold where fpr <= 2% and tpr is the greatest
potential_indices = np.where(fprs <= 0.02)
max_tpr = max(tprs[potential_indices])
index = tprs.tolist().index(max_tpr)
threshold = thresholds[index]
#dt = DecisionTreeClassifier(random_state=0, max_depth=1)
#dt.fit(y_hat, y_test)
#threshold = dt.tree_.threshold[0]
#plot threshold
fig, ax = plt.subplots()
detection_indices = np.where(y_test > 0)
no_detection_indices = np.where(y_test == 0)
ax.scatter(y_hat[no_detection_indices], y_test[no_detection_indices], label="No Detection")
ax.scatter(y_hat[detection_indices], y_test[detection_indices], label="Detection")
ax.plot([threshold, threshold], [-1, 2], color='g')
ax.legend()
ax.grid(True)
plt.show()
#apply theshold
y_hat = (y_hat >= threshold).astype(int)
auc_score = roc_auc_score(y_train, y_hat)
predicted_tpr = tprs[index]
predicted_fpr = fprs[index]
print("AUC", auc_score)
print("Predicted TPR", predicted_tpr)
print("Predicted FPR", predicted_fpr)
print("Threshold", threshold)
print("-----------------------End of training---------------------------")
return model, (X_test, y_test), predicted_tpr, predicted_fpr, threshold, auc_score
#Add filtering to original data
#initialize
mass_pair_ids_to_window_size = dict()
for mass_pair_id in range(max_mass_pair_count):
mass_pair_ids_to_window_size[mass_pair_id] = 5
compound_id_to_filtered_results = dict()
for compound_id, mass_pair_ids_to_dataset in compound_ids_to_mass_pair_ids_to_dataset.items():
print()
print("Compound ID", compound_id)
model, testset, predicted_tpr, predicted_fpr, threshold, auc_score = build_train_model_with_filter(mass_pair_ids_to_dataset, mass_pair_ids_to_window_size)
tpr, fpr = test_model_and_output_results(model, testset, threshold)
compound_id_to_filtered_results[compound_id] = predicted_tpr, predicted_fpr, threshold, auc_score, tpr, fpr
print("Original")
print(original_results_dataframe.to_string(index=False, justify='center'))
print()
print("Filtered")
filtered_results_dataframe = compound_id_results_to_dataframe(compound_id_to_filtered_results)
print(filtered_results_dataframe.to_string(index=False, justify='center'))
Interesting, compounds 4, 10, 13 and 21 do worse and compounds 8, 14, and 15 do better. Mass pairs 16, 18, 46 tend to be very volatile with lots of apparent noise. If I just apply filtering to those compounds I would be interested to see what happens.
#search
#initialize
mass_pair_ids_to_window_size = dict()
for mass_pair_id in range(max_mass_pair_count):
mass_pair_ids_to_window_size[mass_pair_id] = 5
mass_pair_ids_to_window_size[16] = 9
mass_pair_ids_to_window_size[18] = 9
mass_pair_ids_to_window_size[46] = 9
# check compounds using above mass pairs
mini_compound_id_to_filtered_results = dict()
for compound_id in [8, 13, 15]:
mass_pair_ids_to_dataset = compound_ids_to_mass_pair_ids_to_dataset[compound_id]
print()
print("Compound ID", compound_id)
model, testset, predicted_tpr, predicted_fpr, threshold, auc_score = build_train_model_with_filter(mass_pair_ids_to_dataset, mass_pair_ids_to_window_size)
tpr, fpr = test_model_and_output_results(model, testset, threshold)
mini_compound_id_to_filtered_results[compound_id] = predicted_tpr, predicted_fpr, threshold, auc_score, tpr, fpr
mini_filtered_results_dataframe = compound_id_results_to_dataframe(mini_compound_id_to_filtered_results)
print(mini_filtered_results_dataframe.to_string(index=False, justify='center'))
Compound 8 was the only one that got better. I wonder if I remove mass pairs that are adding noise to our graphs if that will help stabilize our results. Also I will try adding mass pairs that are shared across compounds to see if that can help differentiate compounds. For example compound 8 and 15 share mass pairs. After adjusting mass pairs it may be worth revisiting mass pair filtering.
compound_id_to_adjusted_mass_pair_ids = compound_id_to_mass_pair_ids.copy()
if 7 in compound_id_to_adjusted_mass_pair_ids:
del compound_id_to_adjusted_mass_pair_ids[7]
if 19 in compound_id_to_adjusted_mass_pair_ids:
del compound_id_to_adjusted_mass_pair_ids[19]
compound_id_to_adjusted_mass_pair_ids[0] = [22, 23]
compound_id_to_adjusted_mass_pair_ids[3] = [47, 49, 50]
#4 same
compound_id_to_adjusted_mass_pair_ids[8] = [36, 39, 46]
compound_id_to_adjusted_mass_pair_ids[10] = [0, 3, 4]
compound_id_to_adjusted_mass_pair_ids[13] = [41, 42]
compound_id_to_adjusted_mass_pair_ids[14] = [16, 18]
compound_id_to_adjusted_mass_pair_ids[15] = [36, 39]
compound_id_to_adjusted_mass_pair_ids[18] = [21, 40]
compound_id_to_adjusted_mass_pair_ids[21] = [33, 34]
#22 same
display(compound_id_to_adjusted_mass_pair_ids)
compound_ids_to_adjusted_mass_pair_ids_to_dataset = get_compound_ids_to_mass_pair_ids_to_dataset(merged_data_set, compound_id_to_adjusted_mass_pair_ids)
compound_id_to_adjusted_results = dict()
for compound_id, mass_pair_ids_to_dataset in compound_ids_to_adjusted_mass_pair_ids_to_dataset.items():
print()
print("Compound ID", compound_id)
model, testset, predicted_tpr, predicted_fpr, threshold, auc_score = build_train_model(mass_pair_ids_to_dataset)
tpr, fpr = test_model_and_output_results(model, testset, threshold)
compound_id_to_adjusted_results[compound_id] = predicted_tpr, predicted_fpr, threshold, auc_score, tpr, fpr
adjusted_results_dataframe = compound_id_results_to_dataframe(compound_id_to_adjusted_results)
print(adjusted_results_dataframe.to_string(index=False, justify='center'))
Compound 14 seems to be very noisy and needs many more samples in order to try to distinguish when we have a detection or not. I will need to consider removing this compound for now. Some of the compounds do better others do worse. I will try adding filtering to see if we can get even better results.
#Add filtering to original data
#initialize
mass_pair_ids_to_window_size = dict()
for mass_pair_id in range(max_mass_pair_count):
mass_pair_ids_to_window_size[mass_pair_id] = 5
compound_id_to_adjusted_filtered_results = dict()
for compound_id, mass_pair_ids_to_dataset in compound_ids_to_adjusted_mass_pair_ids_to_dataset.items():
print()
print("Compound ID", compound_id)
model, testset, predicted_tpr, predicted_fpr, threshold, auc_score = build_train_model_with_filter(mass_pair_ids_to_dataset, mass_pair_ids_to_window_size)
tpr, fpr = test_model_and_output_results(model, testset, threshold)
compound_id_to_adjusted_filtered_results[compound_id] = predicted_tpr, predicted_fpr, threshold, auc_score, tpr, fpr
adjusted_filtered_results_dataframe = compound_id_results_to_dataframe(compound_id_to_adjusted_filtered_results)
print(adjusted_filtered_results_dataframe.to_string(index=False, justify='center'))
print("\tOriginal")
print(original_results_dataframe.to_string(index=False, justify='center'))
print()
print("\tFiltered")
print(filtered_results_dataframe.to_string(index=False, justify='center'))
print()
print("\tNo Filtering Adjusted")
print(adjusted_results_dataframe.to_string(index=False, justify='center'))
print()
print("\tFiltered Adjusted")
print(adjusted_filtered_results_dataframe.to_string(index=False, justify='center'))
The results are mixed. In some cases the results are better and in others worse. For results that are the same as the original I am inclined to take the adjusted because it creates a simpler model. Most compounds perform better or equal with intensity smoothing/filtering. The adjusted/filtered mass pairs can do much worse meaning the removed or filtered mass pairs have some important features that distinguish them as detections. We will need to adjust our filtering and adjusted mass pairs on a compound basis.
#From results choose whether to adjust or filter to get best model
def percent_to_float(x):
return float(x.strip('%'))/100
def get_roc_auc_score(dataframe, compound_id):
row = dataframe[dataframe['Compounds'] == compound_id]
if not row.empty:
values = row.to_numpy()[0]
return percent_to_float(values[4])
else:
return 0
mass_pair_ids_to_window_size = dict()
for mass_pair_id in range(max_mass_pair_count):
mass_pair_ids_to_window_size[mass_pair_id] = 5
compound_id_to_final_results = dict()
compound_id_to_final_model = dict()
for compound_id in compound_ids_to_mass_pair_ids_to_dataset.keys():
original_score = get_roc_auc_score(original_results_dataframe, compound_id)
filtered_score = get_roc_auc_score(filtered_results_dataframe, compound_id)
nfa_score = get_roc_auc_score(adjusted_results_dataframe, compound_id)
filtadj_score = get_roc_auc_score(adjusted_filtered_results_dataframe, compound_id)
scores = [filtadj_score, nfa_score, filtered_score, original_score]
choice = scores.index(max(scores))
labels = ["Filtered Adjusted", "Non-filtered Adjusted", "Filtered", "Original"]
abbrev_labels = ["Filt. Adj.", "Non-fil. Adj.", "Fil.", "Orig."]
print()
print("Compound ID", compound_id)
print(labels[choice])
if choice > 1:
#original
mass_pair_ids_to_dataset = compound_ids_to_mass_pair_ids_to_dataset[compound_id]
else:
#adjusted
mass_pair_ids_to_dataset = compound_ids_to_adjusted_mass_pair_ids_to_dataset[compound_id]
if choice % 2 == 1:#Even has filters 0 and 2
model, testset, predicted_tpr, predicted_fpr, threshold, auc_score = build_train_model(mass_pair_ids_to_dataset)
else:
model, testset, predicted_tpr, predicted_fpr, threshold, auc_score = build_train_model_with_filter(mass_pair_ids_to_dataset, mass_pair_ids_to_window_size)
tpr, fpr = test_model_and_output_results(model, testset, threshold)
compound_id_to_final_results[compound_id] = predicted_tpr, predicted_fpr, threshold, auc_score, tpr, fpr, abbrev_labels[choice]
compound_id_to_final_model[compound_id] = model
final_results_dataframe = compound_id_results_to_dataframe(compound_id_to_final_results, ["Comp.", "TPR" ,"FPR", "Thresh.", "ROC Score", "Test-TPR", "Test-FPR", "Type"])
print(final_results_dataframe.to_string(index=False, justify='center'))
print()
We have not exactly reached our intended goal, but we have acheived some good results. I would like to see if I could extend my neural network to add additional features that may increase my prediction accuracy. This will allow me to evaluate how good our model is.
from sklearn.model_selection import train_test_split
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.models import Sequential, Model
from keras.layers import Dense, Conv1D, \
MaxPooling1D, GlobalAveragePooling1D, Dropout, \
Input, Permute, Concatenate, Flatten
import time
def build_compound_model_with_extra_features(mass_pair_count, max_timestep, print_summary=True):
shape_input = Input(shape=(mass_pair_count, max_timestep))
features_input = Input(shape=(mass_pair_count, 2))# width and position
x = Permute((2, 1))(shape_input)
x = Conv1D(128, 5, activation='relu')(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D()(x)
x = Dropout(0.2)(x)
x = Conv1D(128, 3, activation='relu')(x)
x = Conv1D(128, 3, activation='relu')(x)
x = GlobalAveragePooling1D()(x)
x = Dropout(0.2)(x)
shape_output = Dense(1, activation='relu')(x)
x2 = Dense(32, activation='relu')(features_input)
x2 = Dropout(0.4)(x2)
feature_output = Dense(1, activation='relu')(x2)
feature_output = Flatten()(feature_output)
merge = Concatenate()([shape_output, feature_output])
output = Dense(1, activation='sigmoid')(merge)
model = Model([shape_input, features_input], output)
if print_summary:
print(model.summary())
return model
def build_train_model_with_extra_features(mass_pair_ids_to_dataset, mass_pair_ids_to_features):
mass_pair_ids_to_dataset_copy = mass_pair_ids_to_dataset.copy()
mass_pair_ids_to_features_copy = mass_pair_ids_to_features.copy()
mass_pair_count = len(mass_pair_ids_to_dataset_copy)
model = build_compound_model_with_extra_features(mass_pair_count, max_timestep, False)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
intensities_and_outputs = np.stack(list(mass_pair_ids_to_dataset_copy.values()), axis=1)
features_and_outputs = np.stack(list(mass_pair_ids_to_features_copy.values()), axis=1)
print(list(mass_pair_ids_to_dataset_copy.keys()))
#validate
#for i, (mass_pair_ids, dataset) in enumerate(mass_pair_ids_to_dataset.items()):
# if not np.array_equal(intensities_and_outputs[1,i], dataset.to_numpy()[1]):
# print("stack fail")
# break
X = intensities_and_outputs[:,:,:-1]
#normalize
def normalize(array):
for i, data in enumerate(array):
array[i] = np.nan_to_num(data/np.amax(data))
return array
X = normalize(X)
X2 = features_and_outputs[:,:,:-1]
X2 = X2/max_timestep
y = intensities_and_outputs[:,0,-1]#all of axis 1 will have same value
#convert y to binary form
y = (y >= 0).astype(int)
X_train, X_test, X2_train, X2_test, y_train, y_test = train_test_split(X, X2, y, test_size=0.2, random_state=7)
detected_count = sum(y_train)
non_detected_count = len(y_train) - detected_count
arange = np.arange(X.shape[-1])
fig = plt.figure()
plot_count = detected_count if detected_count < 8 else 8
detected_train_X = X_train[np.where(y_train > 0)]
subplot = 241
for sample_index in range(plot_count):
ax = fig.add_subplot(subplot)
subplot += 1
for mass_pair_index, array in enumerate(detected_train_X[sample_index]):
ax.plot(arange, array)
plt.show()
print("Train on detected ", detected_count, " non-detected ", non_detected_count)
start = time.time()
early_stopping = EarlyStopping('val_acc', patience=10)
model.fit([X_train, X2_train], y_train, epochs=100, verbose=0,
validation_split=0.2, callbacks=[early_stopping])
end = time.time()
train_time = end - start
print("Train time in seconds: ", train_time)
y_hat = model.predict([X_train, X2_train])
fprs, tprs, thresholds = roc_curve(y_train, y_hat)
#choose threshold where fpr <= 2% and tpr is the greatest
potential_indices = np.where(fprs <= 0.02)
max_tpr = max(tprs[potential_indices])
index = tprs.tolist().index(max_tpr)
threshold = thresholds[index]
#dt = DecisionTreeClassifier(random_state=0, max_depth=1)
#dt.fit(y_hat, y_test)
#threshold = dt.tree_.threshold[0]
#plot threshold
fig, ax = plt.subplots()
detection_indices = np.where(y_test > 0)
no_detection_indices = np.where(y_test == 0)
ax.scatter(y_hat[no_detection_indices], y_test[no_detection_indices], label="No Detection")
ax.scatter(y_hat[detection_indices], y_test[detection_indices], label="Detection")
ax.plot([threshold, threshold], [-1, 2], color='g')
ax.legend()
ax.grid(True)
plt.show()
#apply theshold
y_hat = (y_hat >= threshold).astype(int)
auc_score = roc_auc_score(y_train, y_hat)
predicted_tpr = tprs[index]
predicted_fpr = fprs[index]
print("AUC", auc_score)
print("Predicted TPR", predicted_tpr)
print("Predicted FPR", predicted_fpr)
print("Threshold", threshold)
print("-----------------------End of training---------------------------")
return model, (X_test, X2_test, y_test), predicted_tpr, predicted_fpr, threshold, auc_score
from sklearn.metrics import accuracy_score, confusion_matrix, fbeta_score, classification_report, precision_recall_fscore_support
def test_extra_features_model_and_output_results(model, testset, threshold, print_results=True):
X_test, X2_test, y_test = testset
y_hat = model.predict([X_test, X2_test])
#plot threshold
fig, ax = plt.subplots()
detection_indices = np.where(y_test > 0)
no_detection_indices = np.where(y_test == 0)
ax.scatter(y_hat[no_detection_indices], y_test[no_detection_indices], label="No Detection")
ax.scatter(y_hat[detection_indices], y_test[detection_indices], label="Detection")
ax.plot([threshold, threshold], [-1, 2], color='g')
ax.legend()
ax.grid(True)
plt.show()
#apply theshold
y_hat = (y_hat >= threshold).astype(int)
count = len(y_test)
detected_count = sum(y_test)
non_detected_count = count - detected_count
if print_results:
print("Test on detected ", detected_count, " non-detected ", non_detected_count)
print("F0.5 {0:.2%}".format(fbeta_score(y_test, y_hat, beta=0.5)))
print()
print("Confusion Matrix:")
tn, fp, fn, tp = confusion_matrix(y_test, y_hat).ravel()
fpr = fp/non_detected_count
tpr = tp/detected_count
if print_results:
print("True Negative {0:.2%}".format(tn/non_detected_count))
print("False Positive {0:.2%}".format(fpr))
print("False Negative {0:.2%}".format(fn/detected_count))
print("True Positive {0:.2%}".format(tpr))
print()
print("Classification Report")
print(classification_report(y_test, y_hat, target_names=['blank', 'detected'], digits=3))
print("---------------------------end of testing--------------------------------")
return tpr, fpr
def get_compound_ids_to_mass_pair_ids_to_features(merged_data_set, compound_id_to_mass_pair_ids):
merged_data_set = merged_data_set.copy()
compounds_to_detection_data_set = merged_data_set.groupby(['detection']).groups
mass_pair_id_to_group_indices = merged_data_set.groupby(['mass_pair_id']).groups
feature_columns = [merged_data_set.columns[7], merged_data_set.columns[9]]#width and position
compound_ids_to_mass_pair_ids_to_dataset = dict()
for compounds, compound_detection_indices in sorted(compounds_to_detection_data_set.items()):
if compounds == 'None':
continue
else:
compound_ids = string_to_list_of_int(compounds)
for compound_id in compound_ids:
if compound_id not in compound_id_to_mass_pair_ids:
continue #skip
if compound_id not in compound_ids_to_mass_pair_ids_to_dataset:
compound_ids_to_mass_pair_ids_to_dataset[compound_id] = dict()
for mass_pair_id in compound_id_to_mass_pair_ids[compound_id]:
mass_pair_indices = mass_pair_id_to_group_indices[mass_pair_id]
#intersection of group indices
detection_indices = list(set(mass_pair_indices).intersection(set(compound_detection_indices)))
no_detection_indices = list(set(mass_pair_indices) - set(detection_indices))
detection_dataset = merged_data_set.iloc[detection_indices][feature_columns]
no_detection_dataset = merged_data_set.iloc[no_detection_indices][feature_columns]
detection_dataset = detection_dataset.assign(detection=compound_id)
no_detection_dataset = no_detection_dataset.assign(detection=-1)
dataset = detection_dataset.append(no_detection_dataset)
#print(set(dataset['detection']))
if mass_pair_id not in compound_ids_to_mass_pair_ids_to_dataset[compound_id]:
compound_ids_to_mass_pair_ids_to_dataset[compound_id][mass_pair_id] = dataset
else:
pd.concat([compound_ids_to_mass_pair_ids_to_dataset[compound_id][mass_pair_id], dataset])
compound_ids_to_mass_pair_ids_to_dataset[compound_id] = dict(sorted(compound_ids_to_mass_pair_ids_to_dataset[compound_id].items()))
compound_ids_to_mass_pair_ids_to_dataset = dict(sorted(compound_ids_to_mass_pair_ids_to_dataset.items()))
return compound_ids_to_mass_pair_ids_to_dataset
compound_ids_to_mass_pair_ids_to_dataset = get_compound_ids_to_mass_pair_ids_to_dataset(merged_data_set, compound_id_to_mass_pair_ids)
compound_ids_to_mass_pair_ids_to_features = get_compound_ids_to_mass_pair_ids_to_features(merged_data_set, compound_id_to_mass_pair_ids)
print("Found compounds ", list(compound_ids_to_mass_pair_ids_to_dataset.keys()))
compound_id_to_extra_features_results = dict()
for compound_id, mass_pair_ids_to_dataset in compound_ids_to_mass_pair_ids_to_dataset.items():
mass_pair_ids_to_features = compound_ids_to_mass_pair_ids_to_features[compound_id]
print()
print("Compound ID", compound_id)
model, testset, predicted_tpr, predicted_fpr, threshold, auc_score = build_train_model_with_extra_features(mass_pair_ids_to_dataset, mass_pair_ids_to_features)
tpr, fpr = test_extra_features_model_and_output_results(model, testset, threshold)
compound_id_to_extra_features_results[compound_id] = predicted_tpr, predicted_fpr, threshold, auc_score, tpr, fpr
extra_features_results_dataframe = compound_id_results_to_dataframe(compound_id_to_extra_features_results)
print(extra_features_results_dataframe.to_string(index=False, justify='center'))
print()
Attempting to incorporate width and position into the network has resulted in much worse results. It seems that the best option is to go ahead with the previously refined model. I do believe that my refined model could do even better if we had a larger sample set.
def get_compound_ids_to_substrate_to_masspair_to_dataset(merged_data_set, compound_id_to_mass_pair_ids):
merged_data_set = merged_data_set.copy()
compounds_to_detection_data_set = merged_data_set.groupby(['detection']).groups
substrate_to_group_indices = merged_data_set.groupby(['substrate']).groups
mass_pair_id_to_group_indices = merged_data_set.groupby(['mass_pair_id']).groups
timestep_columns = merged_data_set.columns[10:]
compound_ids_to_substrate_to_masspair_to_dataset = dict()
for compounds, compound_detection_indices in sorted(compounds_to_detection_data_set.items()):
if compounds == 'None':
continue
else:
compound_ids = string_to_list_of_int(compounds)
for compound_id in compound_ids:
if compound_id not in compound_id_to_mass_pair_ids:
continue #skip
if compound_id not in compound_ids_to_substrate_to_masspair_to_dataset:
compound_ids_to_substrate_to_masspair_to_dataset[compound_id] = dict()
for substrate, substrate_indices in substrate_to_group_indices.items():
#intersection of group indices
detection_substrate_indices = list(set(substrate_indices).intersection(set(compound_detection_indices)))
if substrate not in compound_ids_to_substrate_to_masspair_to_dataset[compound_id]:
compound_ids_to_substrate_to_masspair_to_dataset[compound_id][substrate] = dict()
for mass_pair_id in compound_id_to_mass_pair_ids[compound_id]:
mass_pair_indices = mass_pair_id_to_group_indices[mass_pair_id]
detection_substrate_mass_pair_indices = list(set(detection_substrate_indices).intersection(set(mass_pair_indices)))
detection_dataset = merged_data_set.iloc[detection_substrate_mass_pair_indices][timestep_columns]
detection_dataset = detection_dataset.assign(detection=compound_id)
#print(set(dataset['detection']))
if mass_pair_id not in compound_ids_to_substrate_to_masspair_to_dataset[compound_id][substrate]:
compound_ids_to_substrate_to_masspair_to_dataset[compound_id][substrate][mass_pair_id] = detection_dataset
else:
pd.concat([compound_ids_to_substrate_to_masspair_to_dataset[compound_id][substrate][mass_pair_id], detection_dataset])
#sort mass pairs
compound_ids_to_substrate_to_masspair_to_dataset[compound_id][substrate] = dict(sorted(compound_ids_to_substrate_to_masspair_to_dataset[compound_id][substrate].items()))
#sort substrates
compound_ids_to_substrate_to_masspair_to_dataset[compound_id] = dict(sorted(compound_ids_to_substrate_to_masspair_to_dataset[compound_id].items()))
compound_ids_to_substrate_to_masspair_to_dataset = dict(sorted(compound_ids_to_substrate_to_masspair_to_dataset.items()))
return compound_ids_to_substrate_to_masspair_to_dataset
compound_ids_to_substrate_to_masspair_to_dataset = get_compound_ids_to_substrate_to_masspair_to_dataset(merged_data_set, compound_id_to_mass_pair_ids)
for compound_id, substrate_to_masspair_to_dataset in compound_ids_to_substrate_to_masspair_to_dataset.items():
print("Compound", compound_id)
for substrate, masspair_to_dataset in substrate_to_masspair_to_dataset.items():
if substrate == "Blank":
continue
dataset = list(masspair_to_dataset.values())[0]
print("Substrate", substrate, "# of samples", int(len(dataset)))
Above is a list of all of the detection samples broken down by substrate. We know that depending on the substrate and the tested compound the shapes of the lines do change. According to this data we should perform best on the compounds that have a high amount of samples within each substrate. For example, compounds 0 and 4 should perform well because they have samples only within the No substrate category therefore they will perform well because their are many good samples to train, validate and test on. Our model struggled on compound 8 giving an average accuracy between ~27-54%. By looking at the distribution of data we can understand why. This might be the kind of compound that needs more than 10 samples per category to train a good model. Depending on how the data is shuffled for train, validate, test our model may not do very well. Same reasoning goes for compounds 13 and 18.
print(final_results_dataframe.to_string(index=False, justify='center'))
If we exclude the results of compounds, 7, 14 and 19, six out of the remaining 10 compounds meet our benchmark of 90% or higher true positive rate and have a false positive rate of less than 2%. Compounds 4 and 13 are close and may just need more sample data to reach our benchmark. I would have thought compound 4 and 18 would have exceeded our benchmark as well, but for some reason they have not. One thing I think I would change is making the samples that have the detection of [21, 0, 18, 4] compounds be a separate model rather than split up into their respective models. This kind of sample is known as a confidence check and is a premade solution that is used to verify the system is calibrated for use. I have noticed that the shapes of our mass pair lines do slightly change when tested within a confidence check and by themselves. This can be observed in compounds 18 and 21. I think perhaps that both these compounds would perform even better without the addition of the confidence check data. By creating a confidence check model there will probably be more features that our CNN can use to better determine a detection. As previously stated, the compounds that performed poorly did not have enough sample data to train on. Looking at how well some compounds did do I think it is a good indicator that the model will work well. With some additional samples and refinement, I could meet my intended benchmark.
I think the models that I have created are trustworthy. I have trained on 60% of my data, validated on 20% of my data and tested on the final 20% of my data. The test results you see are the result of unseen data by my models. There is some robustness to the models as well. If you look at a subset of our sample data by compound, you can see patterns however they are not the same. Filtering helps in some cases where the CNN cannot account for the variability in the signal. The models are quite good at learning different signal patterns by varying substrate granted the right amount of sample data is provided. Models for compounds 10 and 15 validate this statement. Our designed models are quite robust given the correct amount of training data and the right amount of filtering and mass pair selection.
def plot_visuals(mass_pair_ids_to_dataset):
mass_pair_ids_to_dataset_copy = mass_pair_ids_to_dataset.copy()
intensities_and_outputs = np.stack(list(mass_pair_ids_to_dataset_copy.values()), axis=1)
#validate
#for i, (mass_pair_ids, dataset) in enumerate(mass_pair_ids_to_dataset.items()):
# if not np.array_equal(intensities_and_outputs[1,i], dataset.to_numpy()[1]):
# print("stack fail")
# break
X = intensities_and_outputs[:,:,:-1]
#normalize
def normalize(array):
for i, data in enumerate(array):
array[i] = np.nan_to_num(data/np.amax(data))
return array
X = normalize(X)
y = intensities_and_outputs[:,0,-1]#all of axis 1 will have same value
#convert y to binary form
y = (y >= 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
detected_count = sum(y_train)
non_detected_count = len(y_train) - detected_count
arange = np.arange(X.shape[-1])
fig = plt.figure()
plot_count = detected_count if detected_count < 8 else 8
detected_train_X = X_train[np.where(y_train > 0)]
mass_pairs = list(mass_pair_ids_to_dataset_copy.keys())
subplot = 241
for sample_index in range(plot_count):
ax = fig.add_subplot(subplot)
subplot += 1
for mass_pair_index, array in enumerate(detected_train_X[sample_index]):
lines = ax.plot(arange, array, label=mass_pairs[mass_pair_index])
ax.legend(loc='upper right')
plt.subplots_adjust(right=3, top=3)
plt.show()
for compound_id, substrate_to_masspair_to_dataset in compound_ids_to_substrate_to_masspair_to_dataset.items():
print("Compound", compound_id)
for substrate, masspair_to_dataset in substrate_to_masspair_to_dataset.items():
if substrate == "Blank" or len(list(masspair_to_dataset.values())[0]) == 0:
continue
print("Substrate", substrate)
plot_visuals(masspair_to_dataset)
Above is a visualization of what the graphs look like per compound per relevant mass pair. As you can see in some cases the graphs look very similar. In other cases, you can see that there might be a mass pair line that is very noisy and would be better off removed. In most cases the mass pair 46 is very noisy (compound 8) and in other cases it has a consistent look (compound 13). It seems we should further explore if filtering mass pair lines by compound would result in even better results. Because of the consistency we see in the data, we can also see why a CNN performs so well on the dataset.
Before starting this project, our scientists would scratch their heads and say "I don't understand why our algorithms are not detecting the compounds. If I can see it, the algorithms should be able to detect it." We used to use the flawed approach that we assessed in our exploratory data analysis section. This was where we tried to find a peak and characterize it using width, height, position, and area. This was flawed because the approach does not account for noise very well and because of that creates all kinds of problem, such as, inaccurate peak characterization and poor peak selection when there are multiple peaks to choose from.
After exploring the data, I knew I would have to use a CNN to get the best results. The first challenge I ran into was deciding how many hidden nodes should I use for my CNN. There is a tradeoff between training speed and amount learned. Currently there is no rule of thumb or feedback from a trained model on how many nodes you will need or are using other than with experimentation. I have settled on my current model purely by doing a manual parameter grid search and using my results and training time as a metric to find the optimal model. Even if I automated it with a grid search, I still find it troubling that I must basically use a trial and error approach of determining the optimal model.
One area I found interesting was when I was trying to assess how filtering would affect the detection outcome or my CNN. I really wanted to use a grid search starting at 5 and going to 23 in increments of 2 or 4, but after implementing and testing, the grid search took forever, and I was getting mixed results. Using my GPU, it would take an hour and using my CPU it would theoretically take 5 hours. I realize I am complaining about the time when there are models that take a day or more to train, but it made me realize that machine learning is not a fast process. If you worked in a fast pace environment, you could struggle to keep up purely because most of the time you are waiting around for trial and error procedures to finish so that you can assess the results.
By using a CNN, we can learn the characteristics of our mass pair lines to better decide whether we have a detected compound or just chemical noise. Even with a CNN we still must account for noisy lines that can prevent the CNN from working well, but it is much easier to filter or remove a line to see how it impacts the CNNs learning. At least now, we can meet our scientist needs where our detection capabilities can match their detection abilities.
Some improvements I have alluded to all revolve around parameterized grid searching. I could add grid searching to the parameters in the model to find the optimum model for results and train time. I could add grid searching on filtering to find the best filtering on a compound to mass pair basis. From the results of the filtering and/ or mass pair removal I could try to come up with an algorithm in the future that determines whether to remove, filter, or do nothing to the selected mass pair intensities. This would greatly increase the learning of our models especially when adding new compounds to our detection library.